llm-based agent
Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis
Watanabe, Chihiro, Sun, Jingyu
Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)
- Asia > China > Shanghai > Shanghai (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- North America > United States > New York > Richmond County > New York City (0.04)
- (6 more...)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- North America > United States (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Game Theory (0.93)
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct.
Using LLMs in Generating Design Rationale for Software Architecture Decisions
Zhou, Xiyu, Li, Ruiyin, Liang, Peng, Zhang, Beiqi, Shahin, Mojtaba, Li, Zengyang, Yang, Chen
Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.
- Asia > China > Hubei Province > Wuhan (0.41)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Oceania > Australia (0.04)
- (2 more...)
MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Information Technology > Security & Privacy (0.93)
- Health & Medicine (0.71)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
Smith, Chandler, Abdulhai, Marwa, Diaz, Manfred, Tesic, Marko, Trivedi, Rakshit S., Vezhnevets, Alexander Sasha, Hammond, Lewis, Clifton, Jesse, Chang, Minsuk, Duéñez-Guzmán, Edgar A., Agapiou, John P., Matyas, Jayd, Karmon, Danny, Kundu, Akash, Korshuk, Aliaksei, Ananya, Ananya, Rahman, Arrasy, Kulandaivel, Avinaash Anand, McHale, Bain, Zhang, Beining, Alexander, Buyantuev, Rojas, Carlos Saith Rodriguez, Wang, Caroline, Talele, Chetan, Liu, Chenao, Lin, Chichen, Riazi, Diana, Shi, Di Yang, Tewolde, Emanuel, Tennant, Elizaveta, Zhong, Fangwei, Cui, Fuyang, Zhao, Gang, Piqueras, Gema Parreño, Yun, Hyeonggeun, Makarov, Ilya, Cui, Jiaxun, Purbey, Jebish, Dilkes, Jim, Nguyen, Jord, Xiao, Lingyun, Giraldo, Luis Felipe, Chacon-Chamorro, Manuela, Beltran, Manuel Sebastian Rios, Segura, Marta Emili García, Wang, Mengmeng, Alim, Mogtaba, Quijano, Nicanor, Schiavone, Nico, Macmillan-Scott, Olivia, Peña, Oswaldo, Stone, Peter, Kadiyala, Ram Mohan Rao, Fernandez, Rolando, Manrique, Ruben, Lu, Sunjia, McIlraith, Sheila A., Dhuri, Shamika, Shi, Shuqing, Gupta, Siddhant, Sarangi, Sneheel, Subramanian, Sriram Ganapathi, Cha, Taehun, Klassen, Toryn Q., Tu, Wenming, Fan, Weijian, Ruiyang, Wu, Feng, Xue, Du, Yali, Liu, Yang, Wang, Yiding, Kang, Yipeng, Sung, Yoonchang, Chen, Yuxuan, Zhang, Zhaowei, Wang, Zhihan, Wu, Zhiqiang, Chen, Ziang, Zheng, Zilong, Jia, Zixia, Wang, Ziyan, Hadfield-Menell, Dylan, Jaques, Natasha, Baarslag, Tim, Hernandez-Orallo, Jose, Leibo, Joel Z.
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > China > Beijing > Beijing (0.04)
- (7 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.87)
Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation
Zhang, Ke, Zhao, Xiaoning, Zheng, Ce, Ning, Jiahong, Zhu, Dandan, Zhang, Wenqi, Sun, Chen, Sugawara, Toshiharu
This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, P ACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Large Language Model-based Data Science Agent: A Survey
Chen, Ke, Wang, Peiran, Yu, Yaoning, Zhan, Xianyang, Wang, Haohan
The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLMbased agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > China > Jiangsu Province > Yancheng (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (1.00)
- Overview (1.00)
- Research Report > Experimental Study (0.92)
- Health & Medicine (1.00)
- Information Technology (0.67)